Antipattern: Reference Non-grouped Columns

Let’s explore how misconception about the GROUP BY clause causes an antipattern by testing out different queries.

The root cause of this antipattern is simple, and it reveals a common misconception that many programmers have about how grouping queries work in SQL.

The single-value rule#

The rows in each group consist only of items that have the same value in the column or columns we name after GROUP BY. For example, in the following query, there is a one-row group for each distinct value in product_id.

Retrieving the latest bug reported for each product

Every column in the select list of a query must have a single-value row per row group. This is called the Single-Value Rule. Columns named in the GROUP BY clause are guaranteed to be exactly one value per group, no matter how many rows the group matches.

The MAX() expression is also guaranteed to result in a single value for each group: the highest value found in the argument of MAX() from among all the rows in the group.

However, the database server can’t be as certain about any other column named in the select list. It can’t always guarantee that the same value occurs in every row in a group for the other columns.

Retrieving the latest bug reported for each product along with the ID

In this example, there are many distinct values for bug_id for a given product_id because the BugsProducts table associates multiple bugs to a given product. There’s no way to represent all the values of bug_id in a grouping query that reduces it to a single row per product.

Since there is no guarantee of a single value per group in the “extra” columns, the database assumes that they violate the Single-Value Rule. Most brands of databases report an error if we run a query that tries to return a column other than the columns named either in the GROUP BY clause or as arguments to aggregate functions.

MySQL and SQLite have different behavior from other brands of the database, which we have studied in the Legitimate uses of the antipattern section.

Do-what-I-mean queries#

The common misconception that programmers have is that SQL can guess which bug_id we want in the report because MAX() is used in another column. Most people assume that if the query fetches the greatest value, then other columns will naturally take their value from the same row where the greatest value occurs.

Unfortunately, SQL can’t make this inference in several cases:

  • If two bugs have the exact same value for date_reported and that is the greatest value in the group, which value of bug_id should the query report?

  • If we query for two different aggregate functions — for example, MAX() and MIN() — these probably correspond to two different rows in the group. Which bug_id should the query return for this group?

Retrieving records using two aggregate functions, where bug_id corresponds to two different rows
  • If none of the rows in the table matches the value returned by the aggregate function, what is the value of bug_id? This commonly occurs for the functions AVG(), COUNT(), and SUM().
Trying to retrieve records using aggregate function along with the bug_id

These are examples of why the Single-Value Rule is important. Not every query that fails to follow this rule would produce an ambiguous result, but many do. It would be clever if the database could tell an ambiguous query from an unambiguous one and produce an error only when the data contains ambiguity. But that would not be good for application reliability. It would mean that the same query might be valid or invalid, depending on the state of data.

Synopsis: Ambiguous Groups
Solution: Use Columns Unambiguously
Mark as Completed
Report an Issue